Data Tidying

Data Tidying

Daniela Palleschi

Humboldt-Universität zu Berlin

2023-12-13

Learning objectives

Today we will learn…

  • about wide versus long data
  • how to make wide data longer
  • how to make long data wider

Resources

The suggested resources for this topic are

Set-up

Packages

pacman::p_load(tidyverse,
               here,
               janitor)

Data

  • we’ll use the languageR_english.csv dataset (in daten folder)
df_eng <- read_csv(here("daten", "languageR_english.csv")) |> 
1  clean_names() |>
2  arrange(word) |>
3  rename(
4    rt_lexdec = r_tlexdec,
5    rt_naming = r_tnaming
  ) |> 
6  select(age_subject, word, word_category, rt_lexdec, rt_naming)
1
Clean (i.e., tidy) variable names (from janitor)
2
Arrange rows by word in assending order (A-Z)
3
Rename variables…
4
r_tlexdec to rt_lexdec
5
r_tlexdec to rt_lexdec
6
keep only the columns named

Tidy workflow

  • Abbildung 1 shows an overview of the typical data science process
    • import our data, tidy it, then work through a cycle of transforming, visualising, and modelling before finally communicating your findings
  • today we’ll learn about the tidying step

Abbildung 1: A model of the data science process from Wickham et al. (2023) (all rights reserved)

  • so far we’ve learned how to
    • import our data (readr::read_csv)
    • transform data (dplyr package)
    • visualise data (ggplot package)
    • communicate our findings with dynamic reports (Quarto)
  • but we’ve only seen tidy data so far
    • so we haven’t needed to perform the ‘tidy’ step (tidyr package)

Tidy data

  • the same data can be representing multiple ways
  • we’re going to look at 3 tables that present the exact same data in different formats
  • the tables show the same values of four variables:
    • country
    • year
    • population
    • and number of tuberculosis cases
  • Each dataset organises the values differently
  • As we look at each table, take a moment to consider how easy it is to understand the values and relationships between the variables

Tabelle 1: Tabelle 1
country year cases population
Afghanistan 1999 745 19987071
Afghanistan 2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583
Tabelle 2: Tabelle 2
country year type count
Afghanistan 1999 cases 745
Afghanistan 1999 population 19987071
Afghanistan 2000 cases 2666
Afghanistan 2000 population 20595360
Brazil 1999 cases 37737
Brazil 1999 population 172006362
Brazil 2000 cases 80488
Brazil 2000 population 174504898
China 1999 cases 212258
China 1999 population 1272915272
China 2000 cases 213766
China 2000 population 1280428583
Tabelle 3: Tabelle 3
country year rate
Afghanistan 1999 745/19987071
Afghanistan 2000 2666/20595360
Brazil 1999 37737/172006362
Brazil 2000 80488/174504898
China 1999 212258/1272915272
China 2000 213766/1280428583

Rules for tidy data

  • You likely found Tabelle 1 easiest to read
    • it follows the three rules for tidy data (visualised in Abbildung 2):
  1. Each variable is a column, each column is a variable
  2. Each observation is a row, each row is an observation
  3. Each value is a cell, each cell is a single value

Abbildung 2: Image source: Wickham et al. (2023) (all rights reserved)

Why tidy data?

Happy families are all alike; every unhappy family is unhappy in its own way.”

— Leo Tolstoy

Tidy datasets are all alike, but every untidy dataset is untidy in its own way.”

— Hadley Wickham

There are two main advantages to working with tidy data:

  1. working with a consistent data structure allows us to adopt conventions
    • tidy data is the generally agreed upon data structure
    • conventions/tools are built on the assumption of this structure
  2. R’s vectorised nature can shine
    • most built-in R funtions work with vector values (and columns are essentially vectors)
    • all packages in the tidyverse are designed to work with tidy data (e.g., ggplot2 and dplyr)

Aufgabe 1: Tidy data

Beispiel 1  

  1. Go back to Tables 1-3. For each table, describe what each observation and each column represents.
  2. Sketch out the process you’d use to calculate the rate for Tabelle 1. You will need just one verb that:
    • creates a new variable (call it rate) that contains:
      • the number of TB cases per country per year, divided by
      • the matching population per country per year,
      • multiply by 10000
    • hint: Which dplyr verb creates new variables? (Look back at ?@sec-data-transform.)
  3. Look at tables 2 and 3. Would it have been as easy to calculate rate with these data structures?

Data tidying

  • transforming wide data to long data and long data to wide data (among other steps)
    • outcome: tidy data (usually)

Data tidying with the tidyverse

  • tidyr package (from the tidyverse) has two useful functions for transposing our data:

    • pivot_longer(): make wide data longer
    • pivot_wider(): make long data wider

Abbildung 3: die berühmteste Verwendung des Wortes Pivot (zumindest für Millenials) (Friends)

Wide versus long data

  • we often need to convert between wide and long data formats to do different types of summaries or visualisation

  • wide data: all of the observations about one thing are in the same row

    • is usually not tidy
  • long data: each observation is on a separate row

    • is usually tidy
  • let’s start with the most typical case: turning wide data into long data

pivot_longer()

  • in the languageR_english.csv dataset (df_eng)
    • we have 4568 observations (rows)
    • we have 7 variables (columns)
    • the column age_subject tells us whether an observation is from a participant in the ‘old’ or ‘young’ age group
    • the columns word, length_in_letters, written_frequency, and word_category describe properties of the stimulus for a given observation (i.e., the word)
    • the column rt_lexdec contains the response time for a lexical decision task
    • the column rt_naming contains the response time for a word naming task

head(df_eng)

Tabelle 4: df_eng
age_subject word word_category rt_lexdec rt_naming
young ace N 623.61 456.3
old ace N 775.67 607.8
young act V 617.10 445.8
old act V 715.52 639.7
young add V 575.70 467.8
old add V 742.19 605.4
  • is this data in Tabelle 4 tidy?
  • is this data too wide or too long?
  • how might we make this data longer?

Our goal

Abbildung 4: Our plot to be reproduced

  • the two continuous variables rt_lexdec and rt_naming appear in facets
    • facet_wrap() takes a categorical variable as its argument, and produces a facet of each category
  • so we need a categorical variable which contains the levels lexdec and naming
    • and a continuous variable that contains the corresponding response time

pivot_longer()

  • tidyr function pivot_longer() converts a wide data table to a longer format
    • converts the names of specified columns into the values of new a categorical column
    • and combines the values of those columns into a new column
df_eng_long <- 
  df_eng %>% 
  pivot_longer(
    cols = starts_with("rt_"), 
    names_to = "response", 
    values_to = "time"
  )

1df_eng_long <-
2  df_eng %>%
3  pivot_longer(
4    cols = starts_with("rt_"),
5    names_to = "response",
6    values_to = "time"
  )
1
Create a new object called df_eng_long, which contains…
2
df_eng, and then
3
make it longer
4
by taking columns that start with rt_
5
and create a variable called response that contains the names from cols
6
and create a variable called time that contains the values from cols

df_eng_long |> head()
# A tibble: 6 × 5
  age_subject word  word_category response   time
  <chr>       <chr> <chr>         <chr>     <dbl>
1 young       ace   N             rt_lexdec  624.
2 young       ace   N             rt_naming  456.
3 old         ace   N             rt_lexdec  776.
4 old         ace   N             rt_naming  608.
5 young       act   V             rt_lexdec  617.
6 young       act   V             rt_naming  446.

  • let’s compare the observations for the words ace and act in
Tabelle 5: df_eng
age_subject word rt_lexdec rt_naming
young ace 623.61 456.3
old ace 775.67 607.8
young act 617.10 445.8
old act 715.52 639.7
Tabelle 6: df_eng |> pivot_longer(...)
age_subject word response time
young ace rt_lexdec 623.61
young ace rt_naming 456.30
old ace rt_lexdec 775.67
old ace rt_naming 607.80
young act rt_lexdec 617.10
young act rt_naming 445.80
old act rt_lexdec 715.52
old act rt_naming 639.70

  • the two tables contain the exact same information
    • 8 response time values:
      • 4 for rt_lexdec
      • 4 for rt_naming
  • this is an important realisation: we haven’t changed any data or observation values, we’ve just simply re-structured the organisation of the data points

Plotting our tidy data

  • now let’s try to create our plot:
    • age_subject on the x-axis
    • time on the y-axis
    • response categories in facets

Abbildung 5: Response times per age group for the lexical decision task vs. naming task

Aufgabe 1: Tidy data

Beispiel 2  

Recreate Abbildung 5

pivot_wider()

  • it’s more common to want to lengthen your data (take columns and turn their values into new rows)
    • but sometimes we also want to widen our data (take rows and turn their values into new columns)
  • tidyr function pivot_wider() make datasets wider by increasing columns and reducing rows
    • this helps when one observation is spread across multiple rows
  • let’s try to make df_eng wider
    • for example, we could have a single row per word
      • with a single variable for the young subject’s response and the old subject’s response

pivot_wider()

  • pivot wider takes similar arguments to pivot_longer(), with some slight differences:

    • id_cols (optional): identifying columns (which columns uniquely identify each observation?)
    • names_from: what should we call the new column containing the previous column names (must be a categorical variable)?
    • names_prefix (optional): prefix for the new column names (optional)
    • values_from: new column values

  • let’s create two new variables that take their names from age_subject, and their values from rt_lexdec. The result should look like Tabelle 8.
df_eng_wide <-
  df_eng %>%  
  select(-rt_naming) |> 
  pivot_wider(
1    names_from = age_subject,
2    values_from = rt_lexdec,
3    names_prefix = "lexdec_"
  )
1
create new column names using the values in age_subject
2
create new observation values from rt_lexdec
3
add lexdec_ to the beginning of the new column names

  • let’s compare the observations for the words ace and act in
Tabelle 7: df_eng
age_subject word word_category rt_lexdec
young ace N 623.61
old ace N 775.67
young act V 617.10
old act V 715.52
Tabelle 8: df_eng_wider
word word_category lexdec_young lexdec_old
ace N 623.61 775.67
act V 617.10 715.52
  • again, we haven’t changed any data or observation values, we’ve just simply re-structured the organisation of the data points

Unique values

We’ve removed rt_naming because it also has a unique value per word per age group, so not removing it means we don’t change the length of our dataset (still one row per word per age group), we just change the width and introduce NA values for lexdec_young for old subjects and NA values for lexdec_old for young subjects. If we hadn’t removed it, our first 6 rows would’ve looked like Tabelle 9. Compare this to the output in Tabelle 8, do you see the difference?

Tabelle 9: Wider df_eng data with NA’s
word word_category rt_naming lexdec_young lexdec_old
ace N 456.3 623.61 NA
ace N 607.8 NA 775.67
act V 445.8 617.10 NA
act V 639.7 NA 715.52

Learning objectives 🏁

Today we learned…

  • about wide versus long data ✅
  • how to make wide data longer ✅
  • how to make long data wider ✅

Homework

We’ll stick with the df_eng dataset for these tasks.

  1. Using pivot_wider to create rt_naming new variables: naming_old and naming_young which containg the naming response times for old and young participants, respectively. Hint: you will need to remove rt_lexdec. The resulting data frame should have 2284 observations and 6 variables.

  1. Recreate Abbildung 6. Hint: you will need pivot_wider().

Abbildung 6: Scatterplot of lexical decision task response times per word for old versus young participants

  1. Why do we need out df_eng_wide dataset to produce Abbildung 6? In other words, why is df_eng_wide the appropriate structure but not df_eng_long for such a scatterplot?

  1. Using df_eng_long and the summarise() function we saw in the last section, reproduce the summary below:
# A tibble: 2 × 3
  response   mean    sd
  <chr>     <dbl> <dbl>
1 rt_lexdec  708.  115.
2 rt_naming  566.  101.

Hint: do you need to remove NAs (we saw how to do this in the last section)?

Session Info

Hergestellt mit R version 4.3.0 (2023-04-21) (Already Tomorrow) und RStudioversion 2023.9.0.463 (Desert Sunflower).

sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] janitor_2.2.0   here_1.0.1      lubridate_1.9.2 forcats_1.0.0  
 [5] stringr_1.5.0   dplyr_1.1.3     purrr_1.0.2     readr_2.1.4    
 [9] tidyr_1.3.0     tibble_3.2.1    ggplot2_3.4.3   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] utf8_1.2.3        generics_0.1.3    xml2_1.3.4        stringi_1.7.12   
 [5] hms_1.1.3         digest_0.6.33     magrittr_2.0.3    evaluate_0.21    
 [9] grid_4.3.0        timechange_0.2.0  fastmap_1.1.1     rprojroot_2.0.3  
[13] jsonlite_1.8.7    httr_1.4.6        rvest_1.0.3       fansi_1.0.4      
[17] viridisLite_0.4.2 scales_1.2.1      cli_3.6.1         rlang_1.1.1      
[21] crayon_1.5.2      bit64_4.0.5       munsell_0.5.0     withr_2.5.0      
[25] yaml_2.3.7        tools_4.3.0       parallel_4.3.0    tzdb_0.4.0       
[29] colorspace_2.1-0  webshot_0.5.4     pacman_0.5.1      kableExtra_1.3.4 
[33] png_0.1-8         vctrs_0.6.3       R6_2.5.1          magick_2.7.4     
[37] lifecycle_1.0.3   snakecase_0.11.0  bit_4.0.5         vroom_1.6.3      
[41] pkgconfig_2.0.3   pillar_1.9.0      gtable_0.3.4      Rcpp_1.0.11      
[45] glue_1.6.2        systemfonts_1.0.4 highr_0.10        xfun_0.39        
[49] tidyselect_1.2.0  rstudioapi_0.14   knitr_1.44        farver_2.1.1     
[53] htmltools_0.5.5   labeling_0.4.3    svglite_2.1.1     rmarkdown_2.22   
[57] compiler_4.3.0   

Literaturverzeichnis

Nordmann, E., & DeBruine, L. (2022). Applied Data Skills. Zenodo. https://doi.org/10.5281/zenodo.6365078
Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science (2. Aufl.).